Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.
Describe the work you have done this week and summarize your learning.
Reading the data collected from students in 2014 into R
learning2014 <- read.table("C:/Data/IODS-project/learning2014.csv", sep=";", header=TRUE)
looking at the dimensions of the data (first value will be the number of rows (observations), second will be columns (variables))
dim(learning2014)
## [1] 166 7
looking at the structure of the data
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: num 0.037 0.031 0.025 0.035 0.037 0.038 0.035 0.029 0.038 0.021 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
Accessing the ggplot2 library for displaying plots of data
library(ggplot2)
drawing a scatter plot matrix of all the variables in the dataset
pairs(learning2014)
Drawing another same scatter plot excluding gender variable
pairs(learning2014[-1])
Coloring the variable “gender”"
pairs(learning2014[-1], col = learning2014$gender)
Accessing the GGally library to draw more advanced plots
library(GGally)
Creating a plot matrix with ggpair, where we can see the distribution of the variables and their correlation with one another
p <- ggpairs(learning2014)
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As can be seen from the plot most variables have noramal distribution except “age”, the distribution of which is positively skewed. The distribution of “Points” is also skewed negatively.
“Attitude” and “Exam points”" exhibit the highest (positive) correlation, which is reasonable. As your attitute is more positive toward statistics, you are more likely to get a better point.
So, now let’s now look at the scatter plot of points versus attitude
qplot(Attitude, Points, data = learning2014) + geom_smooth(method = "lm")
And, now let’s fit a linear model, and call it ‘my_model’
my_model <- lm(Points ~ Attitude, data = learning2014)
Now, let’s print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.64 1.83 6.358 1.95e-09 ***
## Attitude 352.55 56.74 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The summary indicates that there is a statistically significant positive relationship between Attitude and Points, with P-value of under 0.001. Also F-statistics of the model is high enough.
Other variables which display positive correlations are, e.g., “Attitude” and “deep learning”, “strategic learning” and “Age”. Unsurprisingly, there is a negative correlation (-0.32) between “surface learning” and “deep learning”.
Creating a more advanced plot matrix with ggpairs
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap ("facethist", bins = 20)))
Drawing the new plot
p
This new plot shows differences between males and females in blueish and pinkish color, respectively. It also shows box plots of the variables for both males and females.
As an example, we can now see that attitude of male students toward learning statistics is a bit more positive compared to their females classmates, hence the positive skewness of the distribution for male.
Creating a regression model with multiple explanatory variables (multiple regression). I use Points as the dependent variable (y), and Attitude, surface learning, and strategic learning as explanatory variables (X).
my_model2 <- lm(Points ~ Attitude + stra + surf, data = learning2014)
Printg out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## Attitude 339.5221 57.4148 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
As the summary of the model indicates, only Attitude has a statistically significant relationship with Points.
Now, let’s remove surface learning from the model, and call it my_model3
my_model3 <- lm(Points ~ Attitude + stra, data = learning2014)
Let’s print out the new model
summary(my_model3)
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## Attitude 346.5810 56.5169 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
The model shows that still only attitude is siginificant. However, now the p-value of ‘stra’ is smaller (0.08).This value is smaller than 0.1, which is sometimes considered to indicate the statistical significance.
The model shows that “Attitude”" is positively and strongly correlates with “Points”. As explained earlier, this is reasonable, because those who have a more positive attitude toward sttistical learning will be more likely to receive a higher point (in a statstics test, e.g.). Also, there is a positive correlation between strategic learning and points students have received. This makes sense, too. The more strategic thinking oriented students are, the higher the probability of receiving a better grade. But the correlation is weak, yet could be considered significant.
R-squared of the model is ~ 0.20, which indicates that around 20 percent of the variance of the variables is explained by the regression model. Whether or not this value is acceptable depends on, e.g., the nature of the study and the dicsipline. As a general rule, the higher the R-squared the better the model, but this is not the case always. The adjusted R-squared is the same as R-squared but has been adjusted for the number of predictors in the model.
Producing diagnostic plots
plot(my_model3, which = c(1,2,5))
Now plots of Residuals vs. Fitted values (1), Normal QQ-plot (2), and Residuals vs. Leverage have been produced.
Now, let’s place all the above graphs to the same plots
par(mfrow = c(2,2))
plot(my_model3, which = c(1,2,5))
What do these plots indicate? Let’s first explore “Residuls vs Fitted values”. This plot is used to understand the linearity of the model, the unequality of error variance, and if there is any outliers. Looking at the plot reveals that the residuals are scattered rather randolmy around 0, and have formed a horizontal line, meaning that the relationship of the variables in our model is linear and also the errors are equal. Only there are some values that are far from the rest (lower side of the plot); these could be considered as outliers. How influential they are in our model will be determined by another plot, which we will look at soon.
Now let’s look at the Normal Q-Q plot. This plot helps us understand whetherthe data in our variables are distibuted normally. If points in the plot fall on a straight line, the variables have normal distribution. The points in our plot here fall somewhat on a straight line, hence the normality of the distribution of the variables.
Finally, let’s look at the Residuals vs Leverage. This plot helps us understand the influential outliers, those that actually affect our model, which would be better to be removed. One way to understand those outliers is to look at the Cook’s distance lines (dashed red line). If the distance is high, then there are influential outliers. In our plot, however, the points are very close to the Cook’s distance line, hence there is no influential outlier in our data.
These diagnostic plots explained above confirm the validity of our statistical model; that is, our model is linear, the variables are normally distributed, and there are no major influential outliers.
In this exercise, I am going to perform logistic regressoin to see the effect of different variabes on students’ alcohol consumption. I have combined Math-Por dataset and called alc, which includes data from students in both mathematics and Portugies classes. First, let us run some of the libraries we need.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
alc <- read.table("C:/Data/IODS-project/alc.csv", sep=",", header=TRUE)
This is a dataset which contains information on students achievment in two classes, math and Portuguese language. The aim is to see, e.g., how alocohol consumtpion and gender affect students’ performance. The name of the variables in the dataset can be seen here:
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
In this exercise, I am interested in studying the effect of students’ gender, age, health, and final grade (G3) on students alcohol consumption. Since the dependent variable is categorical (high vs. low consumption), logistic regression analysis is used. The following are my hypotheses:
H1. Male studnets are more likely to consume high amount of alcohol compared with their female classmates
H2. There is a positive relationship between age and alcohol consumption. As such, the older the students the higher the likelihood of high alcohol consumption.
H3. There is a negative relationship between students’ health and high alcohol consumption.
H4. There is a negative relationship between students’ grades and high alcohol consumption.
Let’s look at the disribution of the variables selected:
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups: sex [?]
## sex high_use count mean_grade
## <fctr> <lgl> <int> <dbl>
## 1 F FALSE 156 11.39744
## 2 F TRUE 42 11.71429
## 3 M FALSE 112 12.20536
## 4 M TRUE 72 10.27778
Now we can see that the number of female students who have not drunk high amount of alcohol (False) is larger than those who have (True), and the grade of the former group is slightly higher than that of the latter. The same holds true for male students; the difference is that the number of male students who reprted high alcohol consumption is larger than that of female students, and the difference between thei grades are more significant (~ 10.27 for those who drink, and ~ 12.2 for those who do not).
Now let’s look at some boxplot of the variables
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g1 + geom_boxplot() + ylab("grade")
g2 <- ggplot(alc, aes(x = high_use, y = health, col = sex)) + ggtitle("Student health by alcohol consumption and sex")
g2 + geom_boxplot() + ylab("health")
g3 <- ggplot(alc, aes(x = high_use, y = age)) + ggtitle("Student age by alcohol consumption")
g3 + geom_boxplot() + ylab("age")
In this section,let us look at the logistic model, wherein alcohol consumption is the depedent variables, which is binery (0 and 1), while gender, health, age, and grade are independent variables.
m <- glm(high_use ~ sex + G3 + health + age, data = alc, family = "binomial")
printing out the model:
m
##
## Call: glm(formula = high_use ~ sex + G3 + health + age, family = "binomial",
## data = alc)
##
## Coefficients:
## (Intercept) sexM G3 health age
## -3.78271 0.87393 -0.07177 0.05017 0.18628
##
## Degrees of Freedom: 381 Total (i.e. Null); 377 Residual
## Null Deviance: 465.7
## Residual Deviance: 440.9 AIC: 450.9
coefficients and the summary of the model
coef(m)
## (Intercept) sexM G3 health age
## -3.78270813 0.87393096 -0.07177345 0.05016889 0.18627672
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + G3 + health + age, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5671 -0.8564 -0.6731 1.2261 2.0567
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.78271 1.86027 -2.033 0.042011 *
## sexM 0.87393 0.23551 3.711 0.000207 ***
## G3 -0.07177 0.03560 -2.016 0.043810 *
## health 0.05017 0.08613 0.583 0.560228
## age 0.18628 0.10005 1.862 0.062639 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 440.87 on 377 degrees of freedom
## AIC: 450.87
##
## Number of Fisher Scoring iterations: 4
Now, let us Compute odds ratios (OR)
OR <- coef(m) %>% exp
computing confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
Printing out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02276097 0.0005652066 0.8480239
## sexM 2.39631217 1.5169185615 3.8248624
## G3 0.93074173 0.8672495153 0.9975278
## health 1.05144866 0.8892957060 1.2474625
## age 1.20475559 0.9914394848 1.4692182
As can be seen from the results, the gender has the widest confidence interval.
Hypotheses test results:
The results reveal that gender is a significant variable. According to this and the distribution of this variable, shown earlier, male studnets are more likely to drink high amount of alcohol. This confirms my 1st hypothesis.
There is a positive and significant (though weak) relationship between students’ age and high alcohol consumption, confirming the 2nd hypothesis.
There is no relationship between students’ health and high alcohol consumption, rejecting my 3rd hypothesis. This could be due to the fact that students are very young, hence alcoholc consumption does not affect their health yet.
There is a negative, significant relationship between students’ grade and high alcohol consumption, confirming the 4th hypothesis.
Using the variables which were statistically signifacnt in the previous model, hence removing “health”. Let’s call this m2.
m2 <- glm(high_use ~ sex + G3 + age, data = alc, family = "binomial")
looking at the predictive power of m2
probabilities <- predict(m2, type = "response")
adding the predicted probabilities to ‘alc’
alc <- mutate(alc, probability = probabilities)
using the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
Looking at the last ten original classes, predicted probabilities, and class predictions
select(alc, sex, G3, age, high_use, probability, prediction) %>% tail(10)
## sex G3 age high_use probability prediction
## 373 M 0 19 FALSE 0.6972126 TRUE
## 374 M 2 18 TRUE 0.6236442 TRUE
## 375 F 12 18 FALSE 0.2445347 FALSE
## 376 F 8 18 FALSE 0.3033281 FALSE
## 377 F 5 19 FALSE 0.3945166 FALSE
## 378 F 12 18 FALSE 0.2445347 FALSE
## 379 F 4 18 FALSE 0.3693463 FALSE
## 380 F 4 18 FALSE 0.3693463 FALSE
## 381 M 13 17 TRUE 0.3796480 FALSE
## 382 M 10 18 TRUE 0.4780369 FALSE
tabulating the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 260 8
## TRUE 102 12
displaying the actual values and the predictions
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
Computing the total proportion of inaccurately classified individuals
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2879581
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2853403
After performing this 10-fold crosss-validation test, we can see that the prediction error is larger than the model in the DataCamp (~ 0.28 > 0.26). Therefore, it seems that this model does not perform better. So, let’s remove the variable “age”, and keep only the “sex” and “G3”, which are more significant statistically. Let’s call the model, m3:
m3 <- glm(high_use ~ sex + G3, data = alc, family = "binomial")
Now, let’s look at the prediction error of m3:
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m3, K = 10)
cv$delta[1]
## [1] 0.3167539
Still the prediction error is large. So, let’s introduce new variables to the model: absences and sex, and remove G3.
m4 <- glm(high_use ~ sex + absences, data = alc, family = "binomial")
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m4, K = 10)
cv$delta[1]
## [1] 0.2565445
As can be seen, now the predictive error of the model is smaller than the previous models. It is however the same as the one in the DataCamp exercise.
So, let’s modify our hypotheses and develop a new logistic regression model - with different variables - to test whether or ot our model improves. Accordingly, in the follwing, I introduce explanatory variables to the model, while keeping the dependent variables (high_use) the same. I start with a large number of variables and continue with smaller number.
m5 - explanatory variables: sex, age, studytime, failures, famsup, goout (going out with friends), famrel (quality of family relationship), health, and absences.
m5 <- glm(high_use ~ sex + age + studytime + failures + famsup + goout + famrel + health + absences, data = alc, family = "binomial")
Now, let’s look at a summary of the model as well as the coefficients.
summary(m5)
##
## Call:
## glm(formula = high_use ~ sex + age + studytime + failures + famsup +
## goout + famrel + health + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8993 -0.7472 -0.4766 0.7235 2.6787
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.01069 2.04454 -1.962 0.049802 *
## sexM 0.80299 0.27398 2.931 0.003381 **
## age 0.10516 0.11619 0.905 0.365403
## studytime -0.38004 0.17897 -2.124 0.033712 *
## failures 0.12425 0.22065 0.563 0.573353
## famsupyes 0.02573 0.27122 0.095 0.924417
## goout 0.75279 0.12531 6.008 1.88e-09 ***
## famrel -0.40344 0.14372 -2.807 0.004999 **
## health 0.11984 0.09691 1.237 0.216238
## absences 0.07304 0.02201 3.318 0.000905 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 370.92 on 372 degrees of freedom
## AIC: 390.92
##
## Number of Fisher Scoring iterations: 4
coef(m5)
## (Intercept) sexM age studytime failures famsupyes
## -4.01069222 0.80299478 0.10516426 -0.38004369 0.12425443 0.02573115
## goout famrel health absences
## 0.75278867 -0.40344292 0.11984299 0.07303669
Let us now look at the predictive power of the model, by estimating the prediction error:
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m5, K = 10)
cv$delta[1]
## [1] 0.2303665
As the model summary reveals, the variables, sex, studytime, goout, famrel, and absences have statistically significant relationship with alcohol consumption (high_use). Also, the prediction error of the model now is smaller than previous models. It is also smaller than the model in the DataCamp. Therefore, this is a better model in general.
Now let us look at the boxplot of some of the variables in the model:
g5 <- ggplot(alc, aes(x = high_use, y = studytime, col = sex)) + ggtitle("Student study time and by alcohol consumption and sex")
g5 + geom_boxplot() + ylab("study time")
g6 <- ggplot(alc, aes(x = high_use, y = famrel, col = sex))
g6 + geom_boxplot() + ylab("family relationship")
Now let’s make the model even better by keeping only significant predictors (reducing variables), and call it m6.
m6 <- glm(high_use ~ sex + studytime + goout + famrel + absences, data = alc, family = "binomial")
summary(m6)
##
## Call:
## glm(formula = high_use ~ sex + studytime + goout + famrel + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7526 -0.7780 -0.4612 0.7265 2.6527
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.89648 0.75069 -2.526 0.011526 *
## sexM 0.84545 0.26941 3.138 0.001700 **
## studytime -0.40578 0.17223 -2.356 0.018471 *
## goout 0.76217 0.12333 6.180 6.41e-10 ***
## famrel -0.37891 0.14038 -2.699 0.006950 **
## absences 0.07581 0.02213 3.425 0.000615 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 373.98 on 376 degrees of freedom
## AIC: 385.98
##
## Number of Fisher Scoring iterations: 4
coef(m6)
## (Intercept) sexM studytime goout famrel absences
## -1.89648456 0.84544998 -0.40577857 0.76217262 -0.37890553 0.07581093
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m6, K = 10)
cv$delta[1]
## [1] 0.2277487
As the results show now the predictive error of the model is smaller (~ 0.22), hence a better model. All the variables in this model are now significant. The relationships of these variables can be explaiend as follows:
Sex (gender) is positively related to alcohol consumprion; stated differently, male students are more likely to consume high amount of alcohol, compared with female students.
Study time is negatively correlated with high alcohol consumption. This is logical. The more students drink alcohol the less they spend their time studying.
Going out with friends (gout) is positively and highly correlated with high alcohol consumption. This is not surprising, as those students who go out with their friends more often, are more likely to drink high amount of alcohol.
Interesingly, and as expected, quality of family relationship (famrel) is negatively correlated with high alcohol consumption. This means that students living with a supportive family, wherein they enjoy good social relationship, are less likely to drink high amount of alocohol. For example, when facing with stressful circumstances, instead of drinking, they may socialize with their family members, which helps reduce stress.
Finally, as also anticipated, the number of school absences are positively correlated with alcohol consumption. For instance, hangover in the morning may be one reason for high number of absences.
In this exercise, we use Boston dara, already loaded in r. But, first let’s access some libraries we will need during the exercise.
library(tidyr)
library(ggplot2)
library(corrplot)
## corrplot 0.84 loaded
library(GGally)
After creating the RMarketdown file, we load and explore the Boston dataset.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
According to the structure of the data, Boston dataset, which is already loaded in R, has 506 observattions (rows) and 14 variables (columns). The data are mainly gathered to understand the effect of housing values in suburbs of Boston on different vairbles such as crime rate.
pairs(Boston)
The plot above shows the correlations of all the variables in the dataset. But let’s look at a more advanced plot to see the distribution of the data as well as correlation of the variables.
p <- ggpairs(Boston, mapping = aes(alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
The above plot demonstrates the distribution of the data as well as the correlations of the variables. As an example, data in the variable “rm”, average number of room per dwelling, have normal distribution, and there is a positive, and rather strong, correlation between “zn” (proportion of residential land zoned for lots over 25,000 sq.ft.), and “dis” (weighted mean of distances to five Boston employment centres).
But we can look at the corraltions with more r funcations. Let’s explore cor_matrix function, which provides handy, easy-to-interpret-correlation matrix.
cor_matrix<-cor(Boston) %>% round(digit=2)
cor_matrix %>% round(cor_matrix)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## crim 1.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.6 0.6 0 0 0.0
## zn 0.0 1.0 0.0 0 0.0 0.0 0.0 0.7 0.0 0.0 0 0 0.0
## indus 0.0 0.0 1.0 0 0.8 0.0 0.6 0.0 0.6 0.7 0 0 0.6
## chas 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0
## nox 0.0 0.0 0.8 0 1.0 0.0 0.7 0.0 0.6 0.7 0 0 0.6
## rm 0.0 0.0 0.0 0 0.0 1.0 0.0 0.0 0.0 0.0 0 0 0.0
## age 0.0 0.0 0.6 0 0.7 0.0 1.0 0.0 0.0 0.5 0 0 0.6
## dis 0.0 0.7 0.0 0 0.0 0.0 0.0 1.0 0.0 0.0 0 0 0.0
## rad 0.6 0.0 0.6 0 0.6 0.0 0.0 0.0 1.0 0.9 0 0 0.0
## tax 0.6 0.0 0.7 0 0.7 0.0 0.5 0.0 0.9 1.0 0 0 0.5
## ptratio 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 1 0 0.0
## black 0.0 0.0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0 1 0.0
## lstat 0.0 0.0 0.6 0 0.6 0.0 0.6 0.0 0.0 0.5 0 0 1.0
## medv 0.0 0.0 0.0 0 0.0 0.7 0.0 0.0 0.0 0.0 0 0 0.0
## medv
## crim 0.0
## zn 0.0
## indus 0.0
## chas 0.0
## nox 0.0
## rm 0.7
## age 0.0
## dis 0.0
## rad 0.0
## tax 0.0
## ptratio 0.0
## black 0.0
## lstat 0.0
## medv 1.0
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Let’s explore some examples. We can see that there is a strong and positive relationship between “rad” (index of accessibility to radial highways) and “tax” (full-value property-tax rate per $10,000). In addition, there is a negtive and strong relationship between “lstat” (lower status of the population (percent)) and “medv” (median value of owner-occupied homes in $1000s).
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
Creating a categorical variable of the crime rate in the Boston dataset. First, let’s create a quantile vector of crime rate.
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
Now, let’s create a categorical variable, and call it, ‘crime’:
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, c(label = "low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
Now, let’s remove the original “crim”" from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
And add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Dividing the dataset to train and test sets, so that 80% of the data belongs to the train set.
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
Use the crime as a target variable and all the other variables as predictors.
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2574257 0.2500000 0.2648515 0.2277228
##
## Group means:
## zn indus chas nox rm
## low 0.95759563 -0.8830912 -0.08304540 -0.8716491 0.47342261
## med_low -0.08372813 -0.3053341 -0.07742312 -0.5786489 -0.13436431
## med_high -0.38144987 0.1560976 0.20601021 0.3204821 0.09301832
## high -0.48724019 1.0173395 -0.01556166 1.0623135 -0.38200908
## age dis rad tax ptratio
## low -0.8753564 0.8935254 -0.6859201 -0.7404021 -0.44891649
## med_low -0.3815518 0.3658397 -0.5509118 -0.4739064 -0.07046981
## med_high 0.4113928 -0.3540647 -0.3893910 -0.2862663 -0.21688873
## high 0.7913321 -0.8428187 1.6358846 1.5124511 0.77816452
## black lstat medv
## low 0.37428075 -0.776075873 0.52764115
## med_low 0.30892423 -0.132893945 0.01042791
## med_high 0.08795252 -0.007888781 0.14936612
## high -0.86949637 0.851746135 -0.69057439
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11779753 0.731463897 -0.92835678
## indus -0.02065742 -0.148027594 0.24204698
## chas -0.06077876 -0.064186272 -0.02753877
## nox 0.42238188 -0.721138801 -1.29414811
## rm -0.10185031 -0.030882054 -0.23523583
## age 0.30602342 -0.320551750 -0.32766210
## dis -0.03839524 -0.237272725 0.13836883
## rad 2.92298438 1.026644629 -0.06675228
## tax 0.04388262 -0.137621218 0.57804373
## ptratio 0.12037971 0.007228732 -0.24657288
## black -0.15262600 0.013093565 0.10307866
## lstat 0.22468947 -0.226852584 0.61874175
## medv 0.20746997 -0.440769729 0.04795001
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9431 0.0415 0.0153
Drawing the LDA (bi)plot. But first let’s create a numeric vector of the train sets crime classes.
classes <- as.numeric(train$crime)
Now we’re ready to draw the LDA plot.
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 3)
In part 4, categorical crime variable has already been removed from the test dataset, and we have now the test data and correct class labels. Now, we predict the classes with the LDA model on the test data.
lda.pred <- predict(lda.fit, newdata = test)
Cross tabulating the results by creating a table of the correct classes and the predicted ones.
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 7 1 0
## med_low 2 14 9 0
## med_high 0 4 15 0
## high 0 0 0 35
The table above reveals that in most cases the crime rates have been predicted correctly, but there are some inconsistencies between the predicted results and correct ones. It seems that the class, high (crime rates), is the most precisely predicted one.
First, reloading and standardizing the dataset.
data('Boston')
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
Now, let’s use dist() function to calculate the distances between observation using the most common distance measure, which is Euclidean distance.
dist_eu <- dist(Boston)
summary (dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.624 170.539 226.315 371.950 626.047
Now, let’s use manhattan distance matrix, another distance measure.
dist_man <- dist(Boston, method = "manhattan")
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.016 149.145 279.505 342.899 509.707 1198.265
Now, we investigate the optimal number of clusters, using K-means clustering. First, we run K-means algorithm on the data.
km <-kmeans(Boston, centers = 4)
pairs(Boston, col = km$cluster)
Now, we investigate the optimal number of clusters using K-means clustering, with 10 clusters. One way to specify the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. So, let’s look at the behavior of WCSS and plot it.
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
By visualizing the total WCSS as a graph, we can see that two clusters would seem optimal. Becasue, the optimal number of clusters is when the value of total WCSS changes radically.
So, let’s run k-means again with two clusters and then visualize the results.
km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)
model_predictors <- dplyr::select(train, -crime)
check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')